# e.g.
factor(c(0, 1, 1, 1, 0), labels=c('Female', 'Male'))[1] Female Male Male Male Female
Levels: Female Male
Wrangling Data 3: (factors, dates and functions)
Matthew Bracher-Smith
October 21, 2024
A factor:
We can create them using the factor() function, which takes the format: factor(vector, levels, labels)
[1] Female Male Male Male Female
Levels: Female Male
We can make factors that have an inherent order
But sorting them may not give us what we expect
monthLevels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
data <- factor(c("Dec", "Jun", "Apr"), levels = monthLevels)
sort(data)[1] Apr Jun Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Strings that aren’t in your levels are silently set as NA
[1] Dec <NA> Apr
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Factors also provide an automatic error control, something that is extremely useful when programming and managing large amounts of data. However, we may not want such NAs to go unnoticed!
By contrast, readr’s parse_factor() will warn you
Warning: 1 parsing failure.
row col expected actual
2 -- value in level set Jum
[1] Dec <NA> Apr
attr(,"problems")
# A tibble: 1 × 4
row col expected actual
<int> <int> <chr> <chr>
1 2 NA value in level set Jum
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
If you ever need to access the set of valid levels directly, you can do so with levels(). The function nlevels() can also be used to show the number of levels. Factors allow us to set a semantic order rather than the alphabetical one associated with character vectors. This can be quite handy when our categorical data has an inherent idea of order, like the months of the year or the quality level of a diamond cut.
[1] "moony" "padfoot" "prongs" "wormtail"
[1] "stag" "dog" "otter"
Using diamonds
dplyr::count() to show how many rows there are for each factor level in the ‘cut’ column. Is it any different to using forcats::fct_count()?The only differences here are that dplyr takes a dataframe or tibble, while forcats takes a vector, and some formatting of the output. Both provide useful counts of the levels in our factor. This is often one of the first things we do when encountering a factor!
dplyr::arrange(desc()) to sort the cut column in descending order. What were the rows sorted by?Our output from arrange() is sorted by the order of the levels of our factor. This is different to strings in character vectors, which are sorted alphabetically. We therefore need to be careful when sorting to check our output is as expected!
Why should I care about factors?
contrasts() function.Forcats is an excellent package for dealing with factors because:
It is, however, strictly for humans.
Nowadays it is a common task to extend our datasets based on new results or changes on the criteria used, or rearrange the levels to improve the readability of our data when plotted. In this matter, it is handy to know that we can change both the order and levels of a defined factor, and how to do it.
For some of these examples we are going to use forcats::gss_cat, a dataset created from a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
forcats::fct_relevel() and forcats::fct_inorder()
fct_inorder()
[1] Never married Divorced Widowed Never married Divorced
[6] Married
Levels: Never married Divorced Widowed Married Separated No answer
fct_relevel()
[1] Never married Divorced Widowed Never married Divorced
[6] Married
Levels: Married No answer Never married Separated Divorced Widowed
forcats::fct_recode()
For changing the names of existing levels by hand
myFactor <- factor(c("M", "F", "O", "M", "P", "M",
"F", "F", "F", "M", "O", "P"))
myFactorPub <- fct_recode(myFactor, male = "M", female = "F",
unknown = "O", unknown = "P")
myFactorPub [1] male female unknown male unknown male female female female
[10] male unknown unknown
Levels: female male unknown
forcats::fct_reorder()
Compare the output of the two graphs below:
forcats::fct_reorder():
It can be applied within a ggplot2 call, like below:
It could also be used before the ggplot2 using mutate, like below. In practice, it’s much more common to reorder factors for plotting ‘on the fly’ by doing it inside the ggplot2 call
Using gss_cat
[1] "Southern baptist" "Baptist-dk which" "No denomination"
[4] "Not applicable" "Lutheran-mo synod" "Other"
[7] "United methodist" "Episcopal" "Other lutheran"
[10] "Afr meth ep zion" "Am bapt ch in usa" "Other methodist"
[13] "Presbyterian c in us" "Methodist-dk which" "Nat bapt conv usa"
[16] "Am lutheran" "Nat bapt conv of am" "Am baptist asso"
[19] "Evangelical luth" "Afr meth episcopal" "Lutheran-dk which"
[22] "Luth ch in america" "Presbyterian, merged" "No answer"
[25] "Wi evan luth synod" "Other baptists" "Other presbyterian"
[28] "United pres ch in us" "Presbyterian-dk wh" "Don't know"
# using fct_reorder()
gss_cat |>
group_by(rincome) |>
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()) |>
ggplot(aes(age, fct_reorder(rincome, age))) +
geom_point()# using fct_relevel()
gss_cat |>
group_by(rincome) |>
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()) |>
mutate(rincome = fct_reorder(rincome, age)) |>
ggplot(aes(age, rincome)) +
geom_point()# using fct_relevel()
gss_cat |>
group_by(rincome) |>
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()) |>
ggplot(aes(age, rincome)) +
geom_point()Warning: There was 1 warning in `mutate()`.
ℹ In argument: `rincome = fct_reorder(rincome, age)`.
Caused by warning:
! `fct_reorder()` removing 76 missing values.
ℹ Use `.na_rm = TRUE` to silence this message.
ℹ Use `.na_rm = FALSE` to preserve NAs.
While both of these look nice, the second one is probably more appropriate. This is because our y axis has some inherent order to it, and changing this makes understanding the plot a bit more difficult. Moving ‘Not applicable’ to the beginning to be with similar categories is helpful, however.
forcats::fct_reorder2()
Compare the two plots below:
forcats::fct_reorder2()
forcats::fct_reorder()You can see its use below:
Other useful Forcats functions
fct_rev() reverses the order of the levelsfct_lump() combines the least common factor levels into ‘other’fct_expand() adds new levels to your factorsfct_relabel() automatically relabels factor levelsfct_infreq() orders factors from most frequent to least frequentUsing gss_cat
head(fct_recode(gss_cat$partyid,
`Not strong republican` = 'Not str republican',
`Independent, near democrat` = 'Ind,near dem',
`Independent, near republican` = 'Ind,near rep'))[1] Independent, near republican Not strong republican
[3] Independent Independent, near republican
[5] Not str democrat Strong democrat
10 Levels: No answer Don't know Other party ... Strong democrat
Using gss_cat
diamonds |>
filter(color == 'J', depth > 55, carat <=2.5) |>
ggplot(aes(carat, price, col=fct_reorder2(cut, carat, price))) +
geom_line(alpha=0.6)Changing plot attributes like the legend title will be covered at a later date.
At times, it may seem necessary to convert from a factor to numeric, particularly if you are using a plotting function that requires numeric data. However, as.numeric(my_factor) should not be used for these purposes.
Errors may be obvious:
Here, the values we passed are 0 or 1, but R is 1-indexed, meaning all counting starts from 1, not 0. Internally, R represents the coding of this factor as 1/2, even though we passed the values 0/1.
Errors can also be more subtle:
The latter situation is common if you have ordered categorical data, for example levels of education in a population, coded as numbers 1 to 6. Unfortunately, when we created our factor, we failed to specificy the levels. As nobody with the category 4 happens to be present, the internal coding of the factor levels above 4 were shifted down one. Our output from as.numeric() is not the vector we passed in.
When coercing factors to other types, note
as.numeric() returns R’s internal codes for the factors, not their values [1] 1 1 2 5 3 3 1 6 5 1 6 2
Our output now represents the values we passed in!
YYYY-MM-DD, MM/DD/YYYY, or even DD-MM-YYYY.ymd() converts strings to dates in the format YYYY-MM-DD (year-month-day)mdy(), dmy(), ymd_hms() etc.today() returns the current dateyear(), month(), day() extract the year, month, and day from a date objectinterval() creates an interval between two datestime_length(), which calculates the length of an interval in a specified unitusing the lakers dataset which comes with lubridate
some_function below:using the economics dataset from ggplot2
This is the most powerful tool of every data scientist, and, at the same time, the hardest to master. Everyone has their own opinions and programming styles, but there is some standardization that you should follow to make your code easier to read by others and, most importantly, to your future self. These two terms should be part of your programmer mantra: readability and reusability.
Functions in R:
As you are used to seeing when you take a look at the help of other functions, we can type the variables we expect to be passed to our new function between the parenthesis. The code that will be executed every time we call our function (also known as the body) has to go between {}. Remember to type () after the name of your function in order to execute it. You can also take advantage of the standard return rule: a function returns the last value that it computed. However, it is almost always best to use an explicit return() statement at the end of a function. Code should always be clear. Explicit is better than implicit.
It also is important to know that the variables declared inside the function only exist whilst the function is being executed. Additionally, if we pass a variable created outside the function as one of its arguments, its value will not be changed even if it is edited within the function. This is what is called pass by value.
call your functions after creating them to check their output
return() statement. Choose an appropriate name.The output of the above two functions is identical. This is because, as mentioned above, R will return the most recently evaluated expression if no return() statement is given. Using an explicit return() statement, as in the first version, is preferred. Explicit is better than implicit.
my_divide() which takes two arguments, ‘x’ and ‘y’ and returns x divided by y. Use an explicit return statement.my_divide() function so that there is another argument called ‘tol’. Set it to a very low value, and add it to y before dividingHere we set the default value for tol in function(). This means whenever the function is run, that value is always used for tol unless someone overwrites it by passing a new value in the function call. You can make it a lot easier for someone else to run your functions by setting sensible default values if it is appropriate to do so.
Try to decipher the following code below
df <- data.frame(a = rnorm(10), b = rnorm(10), c = rnorm(10), d = rnorm(10))
df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))The answer to the first question is that this code rescales each column of the randomly generated dataframe to have a range from 0 to 1. The answer to the second question is trickier: the code finishes successfully but if you inspect the line corresponding to column b, you might notice that the purpose of the code is not fulfilled. In fact, during the copy and paste of the column I forgot to replace the last column from a to b.
In the previous example, it is easy to spot the code that would be extremely useful to be transformed into a function. The first step once we have extracted the desired part is to detect and replace the variables by arguments:
We can now reimplement the previous code using our new function:
There is still room for improvement. In this case our dataset has only four rows, but even min() and max() can take a lot of time to run if we have hundreds of thousands of rows. Thus, we can create an auxiliary variable to compute min() only once:
With a function created only once and used in several places, any change on the specifications of the problem can be easily transferred everywhere in the code.
There are a few details that we have disregarded during the creation of our previous function. First, we have created the variables with a single character, rather than using meaningful names. It is time to change that:
What do you think of the new names we have selected for our variables? Do you think they improve its readability?
Next, it is always advisable to write comments in our code. We have implemented this function just now so we are still aware of what is going on. This can change in the future, or another person may want to use our code, so it is important to add comments to ease the comprehension of the code:
Finally, we need to be careful with our function name. rescale() is a clear name, and it is a good idea to use verbs as function names. However, “rescale” has probably been used by other people in other packages. It is a common word for a common operation. We should either choose a less common word, or put our function in a package (easily done in R, but not covered here) so that we can explicitly call it with my_package::rescale() to avoid any confusion.
In general:
source('my_functions_file.R') to load them.When to write a function
This is easily the simplest thing to remember, but the hardest to implement
When NOT to write a function
summarise()!~) is used to create a lambda function, and the dot (.x) is used to refer to the input. used to refer to the input - this is the same as .x. and .x are common conventions# example use of an anonymous function with summarise
gapminder |>
group_by(continent) |>
summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE))) |>
head(3)is equivalent to
An if statement allows you to conditionally execute code. It looks like this:
To get help you need to surround it in backticks: ?`if`. Take into account that this help is not particularly helpful if you are not already an experienced programmer. The condition must evaluate to either TRUE or FALSE. In R, the way to combine multiple conditions is using a single & for and, or a single | for or.
It is also possible to use || (or) and && (and) to combine multiple logical expressions. However, these operators (&& and ||) are short-circuiting: as soon as || sees the first TRUE it returns TRUE without computing anything else. As soon as && sees the first FALSE it returns FALSE. This knowledge can be quite handy when you become a more experienced programmer. When first starting out though, it can be confusing! At the minimum, be aware that & and && are not the same thing, and you most likely need to use & or |. If you’re unsure, look it up, and check your output with ‘dummy’ examples to make sure it works as expected.
Here is a simple example of a condition within a function:
You can concatenate multiple conditions in a simple structure:
But if you need to encode a method that involves a very long series of chained if statements, you should consider rewriting. One useful technique is the switch() function: it allows you to evaluate selected code based on position or name. Here is an example:
if-else statementsifelse() lets us use vectorised if-else statements (note, dplyr has a version called dplyr::if_else() that’s a bit stricter)switch() functiondplyr::case_when() handles multiple vectorised if_else() statementsAnswers not shown here as it is fundamentally interactive!
my_add() that takes two arguments, x and y and returns their sum (do not paste it into the terminal!)source("path/to/script.R")my_add(2, 3)testthat packagemy_multiply() that takes two arguments, x and y and returns their productmy_multiply(2, 3) returns 6readr or fread, but R has a long history of debate and frustration over reading-in strings as factors in base R.